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Abstract 



In this article, an item response (IRT) model is used as a measurement error model 
for the dependent variable of a multilevel model where tests or questionnaires consisting of 
separate items are used to perform a measurement error analysis. The advantage of using latent 
scores as dependent variables of a multilevel model is that it offers the possibility of modeling 
response variation and measurement error and separating the influence of item difficulty and 
ability level. The two-parameter normal ogive model is used for the IRT model. It will be shown 
that the stochastic EM (SEM) algorithm can be used to estimate the parameters which are close 
to the maximum likelihood estimates. It turns out that this algorithm is easily implemented. 
This estimation procedure will be compared to an implementation of the Gibbs sampler in a 
Bayesian framework. Examples using real data are given. 

Key words: Bayes estimates, Data Augmentation, Gibbs sampler, item response 
theory, Markov chain Monte Carlo, multilevel model, stochastic EM, two-parameter normal 
ogive model. 
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Introduction 

Many data in educational science have a hierarchical or clustered structure. For 
example, in schooling systems students are nested within schools. Information relevant to 
educational outcomes is inherently multilevel or hierarchical in nature. To properly understand 
educational phenomena relevant to schooling, it is important to work with multilevel models 
that explicitly take this hierarchical organization into account. Therefore, multilevel analysis 
is a common way for properly analyzing such data (Bryk & Raudenbush, 1992; Goldstein, 
1995). Furthermore, multilevel analysis makes it possible to compare schools in terms of the 
achievements of their students and factors can be studied that explain school differences. 

In Fox and Glas (2000), a multilevel IRT model is proposed to model such data 
and a latent variable is used as outcome in the multilevel analysis. This approach takes into 
account that, for example in school effectiveness research, the students’ abilities are latent 
variables measured with error. The responses to the items of a test or questionnaire are viewed 
as multiple discrete and fallible indicators of the latent dependent variable and the relation 
between the observed indicators and the latent variable is modeled by an item response theory 
(IRT) model. This approach has the advantage that it is no longer assumed that the error 
component is independent of the outcome variable, i.e., the score of the test taker. In IRT, 
measurement error can be defined locally, for instance, as the variance of the ability parameter 
given a response pattern. This local definition of measurement error results in hetroscedasticity. 
Another advantage of the IRT approach is that, contrary to observed scores, latent scores are 
test-independent, which offers the possibility of analyzing data from incomplete designs, such 
as, for instance, matrix-sampled educational assessments, where different (groups of) persons 
respond to different (sets of) items. 

In the field of IRT models some applications of the multilevel model can be found. 
Adams, Wilson and Wu (1997) discuss the treatment of latent variables as outcomes in a 
regression analysis. They show that a regression model on latent proficiency variables can 
be viewed as a two-level model where the first level consists of the item response measurement 
model which serves as a within-student model and the second level consists of a model 
on the student population distribution, which serves as a between-students model. Further, 
Adams, Wilson and Wu (1997) show that this approach results in an appropriate treatment 
of measurement error in the dependent variable of the regression model. Raudenbush and 
Sampson (1999) embedded the Rasch model within a three-level hierarchical regression model, 
that is, the Level 1 model consists of the predictable and random variation among item 
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responses within each group. Another application of multilevel modeling in the framework 
of IRT models was given by Mislevy and Bock (1989) where group-level and student-level 
effects are combined in an hierarchical IRT model. Finally, Patz and Junker (1999) developed 
a generic hierarchical item response model which allows covariates on subjects and covariates 
on items. 

In Fox and Glas (2000), a fully Bayesian estimation procedure is described, and where 
a Markov chain Monte Carlo method (Gibbs sampler) is used for concurrently estimating 
all parameters. The fully conditional decomposition of Gelfand and Smith’s (1990) Gibbs 
sampling produces an approximation for the posterior distributions of the parameters. That 
is, the Gibbs sampler is used to find the mode of the posterior distribution in a Bayesian 
framework, taking account of all sources of uncertainty in the estimation of the parameters. 
In the present paper, the Bayes estimator will be compared to an approximate maximum 
likelihood estimator. Specific properties of maximum likelihood estimates can be found in, for 
example, Lehmann and Casella (1998) and Rao (1973). Besides, the likelihood of the sample 
of observations represented by the data is maximized without any prior knowledge regarding 
the parameters of interest. 

The likelihood function is complex due to the presence of some nuisance parameters. 
Maximizing the likelihood directly is often numerically infeasible. The idea is to view the 
nuisance parameters as unobserved data, and to associate with the given incomplete-data 
problem a complete-data problem for which maximum likelihood estimation is feasible. That 
is, the problem of maximizing the likelihood is reformulated in such a way that the maximum 
likelihood estimates are more easily computed from a complete-data likelihood. The stochastic 
EM (SEM) algorithm is particularly appealing in situations where inference on complete-data 
is easy. The algorithm handles complex missing-data structures in which high-dimensional 
integrations over the nuisance parameters may be involved. It imputes values for the missing 
data and then iteratively performs direct parametric inference based on the complete-data. This 
makes it attractive for estimating the multilevel IRT model with latent variables defined by a 
complex structural model. Moreover, the parameter estimates resulting from the algorithm are 
close to the maximum likelihood estimates. Further applications of the SEM algorithm can be 
found in, e.g., Celeux and Diebolt (1985), Celeux et al. (1996), Diebolt and Ip (1996) and Ip 
(1994). 

In the first section of this paper, the notation and a general multilevel IRT model is 
presented. Next, the principles of SEM and the implementation for estimating the parameters of 
a multilevel IRT model are described. Furthermore, a parallel will be drawn between parameter 
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estimation with SEM and Markov chain Monte Carlo (Gibbs sampler). After that, a Dutch 
primary language test will be analyzed and the mentioned estimators will be compared. Finally, 
the last section contains a discussion and suggestions for further research. 

A Multilevel IRT Model 

This section contains the basic principles and formulae of a multilevel IRT model. For 
a detailed introduction of the model, see Fox and Glas (2000). In its general form. Level 1 
of the two level multilevel model consists of a regression model, for each of J nesting Level 
2 groups, j = 1, . . . , J, in which the nj x 1 ability vector 0j is modeled as a function of Q 
predictor variables, that is, 



Oj — X-jflj + ej, 0 ) 

where Xj is an rijXQ matrix of observed predictors and ej is an rij x 1 vector of residuals, that 
are assumed to be normally distributed with mean 0 and variance <r 2 I nj . All Q + 1 regression 
parameters, P 0j , ■ . ■ ,P Qj , are treated as varying across Level 2, although it is possible to 
constrain the variation in one or more parameters to zero. The random regression parameters 
are treated as outcomes in a Level 2 model 



Pj — Wj 7 + Uj, ( 2 ) 

where Uj is a vector of random effects assumed normally distributed with mean zero and 
covariance T, Wj is a matrix consisting of Level 2 characteristics and 7 is a S x 1 vector 
of fixed effects. 

Suppose each of J2j n j persons, labeled i = 1,... ,rij, j = 1,... , J, respond 
to K items, labeled k = 1, . . . , K. A binary response Y ijk = 1 or 0 is recorded. 
Furthermore it is assumed that, conditionally on the item and population parameters, the 
responses { Yj k } are independent Bernoulli random variables, with probability of success 
Pijk = P ( Yijk = 1 | Qij,ak, b k ) . The normal ogive model is used to model the {p i:j k} ■ This 
leads to, 



Pijk — ^ ) 



( 3 ) 
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where <F denotes the standard normal cumulative distribution function. Below, the parameters 
of item k will also be denoted by £*, £ k = (a k , b k f. Notice, the item difficulty is denoted by 
the usual choice b while regression coefficients are denoted by /?. The two parameter model 
has a discrimination parameter a k for each item k = 1 , ,K. The restrictions a k > 0, 
k = 1, . . . , K, assure that a student, indexed ij, with a higher ability 6ij has a higher probability 
of getting item k correct. 

To model guessing in a multiple choice test another set of parameters, the guessing 
parameters, are introduced in the so called three parameter model. The probability that a 
student correctly answers an item, indexed k, is represented as the sum of the probabilities 
that the student guesses and gets the item correct, c k , plus the probability that the student does 
not guess, (1 - c k ) , and gets the item correct, <f> ( a k 0ij — b k ); that is, 

P (Y ijk = 1 | e tj , a k , b k , c k ) =c k + (l- c k ) <f> (a k dij - b k ) . (4) 

An elaborate description of both models can be found in the pioneering work of Bimbaum 
(1968) and Lord (1980). Discussions and literature reviews are found in Johnson and Albert 
(1999) and van der Linden and Hambleton (1997). 

Formulae (1) and (2) define the structural model and formula (3) or (4) the 
measurement model. Jointly, this defines a multilevel IRT model which will be estimated using 
SEM. 



The SEM Algorithm 

The EM (expectation-maximization) algorithm is a well-known approach for 
computing maximum likelihood estimates in a wide variety of situations (see, Dempster et 
al., 1977). Notably, many incomplete data problems can be handled with the EM algorithm. 
Also the estimation of latent variable models and random parameter models is supported by 
the EM algorithm when they are formulated as missing value problems. In spite of its many 
appealing features, the EM algorithm has several drawbacks. For example, it can converge 
to local maxima or saddle points of the log-likelihood function and its limiting position is 
often sensitive to starting values. In some models, the computation of the E-step involves high 
dimensional integrations. Therefore, the E-step can be computationally difficult. 

The SEM algorithm (Celeux & Diebolt, 1985) provides an alternative to the EM 
approach. Particularly in situations where inference based on complete data is easy, but also 
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in cases where the EM approach is intractable or where the E-step involves high dimensional 
integrations. 

The basic idea underlying the SEM algorithm is to impute missing data with plausible 
values and then update parameters on the basis of the complete-data. The SEM algorithm 
consists of two steps. The S-step generates a complete-data sample by drawing missing data, 
given the observed data and a current estimate of the parameters. In the M-step, the maximum 
likelihood estimate of the parameters is computed based on the complete-data. The entire 
procedure is iterated a sufficient number of times. 

Under specific conditions, the array of estimates corresponding to each draw of 
pseudo-complete data forms a Markov chain that converges to a stationary distribution (Ip, 
1994). The mean of this stationary distribution is close to the maximum likelihood estimate 
and its variance reflects the information loss due to missing data (Diebolt & Ip, 1996). 

Let Y be the observed random sample. The values of the Level 1 and Level 2 
explanatory variables are known. They are denoted as X and W, respectively. The model 
has parameters 0 , £, Level 1 regression coefficients (3, Level 2 regression coefficients 7 and 
variance components a 2 and T. The observed or incomplete-data likelihood of the parameters 
of interest is given by 

Z(£,<T 2 ) 7 ) T;Y) = n/ 

(5) 

where p (y^- | 0y ,£) is the IRT model specifying the probability of the observing response 
pattern y y as a function of the ability parameter 0^- and the item parameters £. Further, 
g [6ij | (3j,cr 2 ) is the density of 0y and h (/3^ | 7 ,T) is the density of f3j. The marginal 
likelihood entails a multiple integral over 0y and Computation of two-dimensional integrals 
suffices. An EM algorithm is easily implemented in case all discrimination parameters are 
equal, that is, in case the measurement error model is the Rasch model (Raudenbush & 
Sampson, 1999). The probability model is then a member of the regular exponential family 
of distributions. A lesser restrictive IRT model, where the discrimination parameters may 
differ per item, is wider applicable but estimating the parameters becomes more difficult. 
This problem of integration and maximization relates to the estimation of a random-effects 
model for ordinal data and to the full information factor analysis model (Anderson, 1985; 
Gibbons & Bock, 1987; Gibbons & Hedeker, 1992; Hedeker & Gibbons, 1994). In the 
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case of a bi-factor model, Hedeker and Gibbons (1994) utilized Gauss-Hermite quadrature 
to numerically integrate over the distribution of random effects. Fisher’s method was used to 
provide the solution to the likelihood equation. The numerical integration is feasible in these 
two-dimensional problems. However, if the number of dimensions is increased, using Gauss- 
Hermite quadrature is no longer feasible. 

An alternative approach is the stochastic EM algorithm which can handle these 
problems and also further development of the multilevel model to three or more levels and more 
complex IRT models. The likelihood should be defined as a function of the complete-data such 
that a simpler likelihood maximization can be performed. This approach follows the procedure 
of Albert (1992) and Johnson and Albert (1999). Assume that there exists a continuous latent 
variable that underlies each binary response. The latent variables 0jj are related to the observed 
responses, Y ijk , of a person, indexed ij, on a item, indexed k. This observation Y ijk can be 
interpreted as an indicator that a continuous variable with normal density is above or below 
zero. This variable is denoted as Zij k . It follows that 



with Eij k ~ N (0, 1) and Y ijk = I (Z ljk > 0) . Here, I (.) is an indicator variable taking the 
value one if its argument is true and zero otherwise. The latent variable structure yields a 
model that is equivalent to the normal ogive model. The complete-data likelihood is given by 



where p (zjj | 0^, £) is normally distributed according to formula (6). It will be shown below, 
that maximization of (7) becomes easy, due to the fact that the complete-data likelihood 
consists of a product of normal densities. In the exponential family case the stochastic EM 
estimates converge to the maximum likelihood estimates by O (1/n) (Diebolt & Ip, 1995). It 
must be pointed out that the SEM algorithm provides only convergence in distribution and does 
not entail a pointwise estimator, as in the case of the EM algorithm. A pointwise estimator 
can be obtained by avaraging a sufficient number of successive iterations during the estimation 
procedure. The values generated by stochastic EM at the M-step, corresponding to each draw of 
the complete-data, form a Markov chain with a stationary distribution which is approximately 
centred at the maximum likelihood estimates. The sequence of points represents a set of good 



Zij k — CL k 0ij b k “F £ij k i 



( 6 ) 
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guesses, called the plausible region, with respect to values of the missing data. Usually, the 
mean of this stationary distribution is considered as an estimate for the parameters, but the 
point in the plausible region with the largest observed log-likelihood could also be considered 
as an estimate for the parameters. Computation of this estimate requires the extra effort of 
evaluating the observed log-likelihood in every iteration (Diebolt & Ip, 1995). 

Implementation of the SEM Algorithm 

The multilevel IRT model can be set up as a missing data problem by defining 0 and (3 
as unobserved variables. The main interest is estimating the item parameters, £, the regression 
coefficients on Level 2, 7 , and the variance on Level 1 and Level 2, <r 2 and T, respectively. The 
SEM procedure, for current values of the parameters £, 7 , a 2 and T, completes the observed 
data by drawing pseudo-complete data, and then computes the maximum likelihood estimates 
of the parameters based on the completed data. The first step in implementing SEM is creating 
pseudo-complete data. Hence, samples from the joint distribution of 6, (3 | Y, <t 2 , 7 ,T are 
required. Directly drawing a sample from this joint conditional distribution is difficult. It 
turns out to be easier to use the Gibbs sampler (e.g., see, Gelfand & Smith, 1990; Geman & 
Geman, 1984) to simulate independent draws from the joint conditional distribution of 0 and 
(3. Therefore, a continuous latent variable structure is introduced that underlies each binary 
response. A sample from Z, 6, (3 | Y, £, <x 2 , 7 , T is obtained by drawing from the distributions 
p ( z | y, 0 , i) ,p(0 | z, £, /3, a 2 ) andp {(3 \ 0, a 2 , 7 , T). The proposed Gibbs sampler consists 
of three steps. 

First, consider the distribution of p (z | y , 0, £) . This conditional distribution of the 
latent variables Z given 0 , £, Y follow from formula (6) . For the three parameter normal ogive 
model, formula (4) , consider random variables V ijk such that = 1 if a student, indexed ij, 
knows the correct answer to item k and V ijhn = 0 if the student does not know the correct answer 
to item k. The variables formula (6) , are related to the variables V^ fc . That is, several cases 
arise depending on the value ofY ljk . Suppose that Yij k = 0, then Vij k = 0 and Z ijk < 0. Next, 
if Y ijk = 1 and V ijk = 0, then Z ijk > 0. Otherwise if Y ijk = 1 and Z ijk < 0, then V ijk = 1. 
The Gibbs sampling procedure can be extended to obtain a sample from the distribution of the 
underlying dichotomous latent variables Zij k and (Beguin, 2000; Johnson & Albert, 1999). 

Second, the ability parameter 0 , given pseudo-complete data Z, and estimates of 
(£, / 3 , <t 2 ) are independent and distributed as a mixture of normal distributions. From (1) and 
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(6) it follows that, 



P ( Oij | z 0 p (T 2 ) oc p (z ij | p ( 9ij | /3j,o- 2 ) 

« exp^(%-?y) exp — ^ (0 O - Xy/3^) 2 (8) 



with 






Sfc=i °fc + M 

V* a 2 
2_/fc=i a it 



and v = °fc) • ^ foll° ws directly from standard Bayesian results for normally 

distributed variables and a normal prior (e.g., see, Box & Tiao, 1973; Lindley & Smith, 1972) 
that 



, (dii/v + X.j/T/a 2 1 

«» i z fj , «, />,, ~ n l J/ 1/v+1 y/ -. 



ro'i 



Notice that the posterior mean is a composite estimator; as the sampling variance v of 
increases, the relative weight placed on the prior mean, Xy/3^, increases. 

Third, the fully conditional distribution of 0j entails a normal prior induced by the 
Level 2 model and normally distributed observations that is, 



p(0 j | 0 j ,o- 2 ,7,T) a p(0j | 0 j ,a 2 )p(0 j | 7,T) 

« «P (/» - 3,)‘x‘x, ((3 - 9,)) x 

exp (0, - W )7 )‘ T- 1 (0, - W/r)) 



with = (XJXj) 1 X‘0 r Thus 

/9 j |0 j ,a 2 )7 ,T~Ar(Dd,D), (10) 

where Sj = <r 2 (X‘Xj) \ d = ■ + T -1 Wj 7 and D = (E^+T -1 ) . IfXj, 

j = 1, . . . , J, does not have a full column rank, X‘Xj has no inverse and there is no unique 
solution to the normal equations. Besides, if X‘Xj is in the form of a correlation matrix and 
it is not nearly a unit matrix, the least squares estimates are sensitive to errors. Estimators 
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depending on a generalized inverse of X*Xj are not unique because they depend entirely on 
what generalized inverse is used to define the estimator (Searle, 1971). Estimation of /3j based 
on the matrix (X*Xj 4- kI Q+1 ) ,k > 0 has been found to be a procedure that can help to 
circumvent the difficulties associated with the usual least squares estimates (Hoerl & Kennard, 
1970). 

At each step, the fully conditional distributions of Z and 6 are considered at the level 
of persons, samples are drawn for i = 1, . . . , rij, j = 1 , . . . , J. The regression coefficients on 
Level 1 are sampled for each group j. Eventually, a random sample (Z, 6,(3) is obtained after 
sufficient draws from the sequentially updated fully conditional distributions. 

In case of normal components, a more efficient alternative of updating is a block Gibbs 
update (Gelman et al., 1995; Hobert & Geyer, 1998; Roberts & Sahu, 1997). In that case, all 
of the normal components are updated simultaneously. To use this block Gibbs sampler, the 
density of 6, (3 | Z, £, a 2 , 7, T is needed. Treat the regression on the regression parameters, /3, 
on Level 1 as J(Q + 1) prior ‘data points’. The joint fully conditional distribution of 6 V (3 3 
can be deduced from the weighted linear regression of ‘observations’ Z* on (0j,( 3j) , using 
‘explanatory variables’ X* and ‘variance matrix’ E*, where 



' Zj+b' 
0 


,x* = 


’ a <g> I n . 
In, 


0 

-X, 


v* -1 _ 




0 ^ 

1 

b 


0 

0 


1 

1 




0 


I<3+i 




0 


0 


x -i 



It follows that, 

| Z„t-r.T~ N , (11) 

with 

(«y.3,)‘ = (x-'E--x;)" 1 x-'E--z-. 

The proposed Gibbs sampler samples successively from (6) and (11) until a sample 
(Z,0,(3) has been obtained from the simultaneous distribution of (Z,6,(3) given the other 
parameters and the observed data. That is, until convergence of the Gibbs sampler has occurred. 
This completes the stochastic S-step of the SEM algorithm. The attained pseudo-complete data 
(Z,Q,(3) is then used to estimate (£,er 2 ,7,T). Therefore, the M-step entails computing the 
estimates of (£, er 2 , 7, T) . 
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Because, according to (6), the item-parameters depend only on the latent data Z and 
the ability parameters, 0, it follows that 



Z*; = [ 0 — 1 ] + e*;, 

where Z fc = {Z nk , ... , Z nilk , ... , Z njJk f and e k = (e llk , . . . is a random sample 

from AT (0, 1). Therefore, 

= (H t H) _ 1 H t Z fc) (12) 

with H = [ 0 — 1 ] . The £ stands for an estimate of the item parameters based on the pseudo- 
complete data (Z, 0,(3 ) . The estimate exclusively based on the observed data will be marked 
with a hat. The same notation will be used for the other parameters. 

The estimator of the variance on Level 1, a 2 , follows directly from the regression of 0 
on X, with (3 as regression coefficients. Thus, 

s2 =.jrEE(^~ x ‘A) 2 - ,13 > 

j= 1 1=1 

which is the maximum likelihood estimator of a 2 given 0 and (3. 

The Level 2 model for school j can be written as 

(3j = Wj 7 + Uj, (14) 

with E (uj) = 0, E (uju‘) = T. Because (14) is a normal linear model given regression 
coefficients (3j it follows that the generalized least squares estimator of 7 is 

7 = (ii) 

\j=i / j=i 

Likewise it follows that the estimator of T is 

J= 1 

Notice that an Iterative Generalized Least Squares algorithm (Goldstein, 1995) is needed to 
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compute both estimates in formula (15) and (16) . 

In conclusion, the algorithm to estimate all parameters involves iterating two steps. At 
the S-step, the missing data are sampled, given the observed data and a current estimate of the 
parameters. Here the S-step is made up of formula (6) and (11). With use of the Gibbs sampler 
a pseudo-complete sample is drawn. At the M-step, the missing data are imputed to estimate 
all parameters, see formula (12) , (13) , (15) and (16). 

Eventually, plausible values or estimates from the M-step, based on the augmented 
data from the S-step, are used in the estimation of the parameters of interest. Therefore, define 
the parameters of interest A = (£,<7 2 ,7,T) . The array of points generated by SEM are a 
Markov chain, denoted by € n| = \m e n| , where m 

denotes the iteration number. Under very mild conditions, which are easily verified for the 
present model, the Markov chain |a m j is approximately stationary. That is, the stationary 
distribution of j A "* j does not change as m takes on different values. As noted above, usually, 
the mean of the stationary distribution is considered as an estimate of A. That is, after a bum-in 
period of Mo iterations, 

M 

(17) 

u m=Mo+l 

Each step of the SEM algorithm incorporates a stochastic step, which prevents the sequence 
from being immobilized near a saddle point. Therefore, SEM does not terminate in any 
stationary point. 

As noted above, another estimator for the parameters can be derived from the values 
in the plausible region generated at each M-step. This estimator computed from the stochastic 
EM iterates is the point with the largest observed log-likelihood, formula(5), this is, 



(18) 

Obtaining this point requires the calculation of the incomplete log-likelihood in every iteration 
of the stochastic EM algorithm. Gauss-Hermite quadrature can be used to carry out the 
integration over the parameters (6, (3 ) . It is also possible to compute the incomplete likelihood 
via the expected complete likelihood, that is, 



A* = arg max Z (A | y) . 

1 <m<M 
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where Z* represent the augmented data (Z, 6, (3) and k (z* | y, A) is the density of the missing 
data conditional on the observed data. In this case, computing A* via (19) involves a higher 
dimensional integration and is consequently computational more demanding. A rough method 
as Monte Carlo integration of (19) is rather difficult because it needs independent samples of 
the augmented data Z* at every iteration. The point in the plausible region which maximizes the 
observed likelihood is an approximation of the actual maximum likelihood estimator related to 
the observed likelihood, formula (5) . For a sufficient number of stochastic EM iterations, that 
is, for a sufficient number of points in the plausible region A* gets close to the maximum 
likelihood estimator. These points can also be used to check whether the stochastic EM 
estimator, A, approximates the maximum likelihood estimator of formula (5) . 

The variances of the estimators are estimated by the inverse of the observed 
information matrix evaluated at A = A, formula (17) , or at the point with the largest observed 
likelihood A = A*, formula (18). The observed information matrix is easily computed using 
Louis identity which relates the observed-data likelihood and the complete-data likelihood 
(Louis, 1982), that is 



d?l (A; y) 
dXdX 1 



d?l c (A;z*) . 1 




dl c (A; z*) , 1 


d XdX> ly J 


- COV A 


dX |y J 



( 20 ) 



where the expectation is taken with respect to k (z* | y, A) . The right-hand side of (20) is 
computed with augmented data samples generated independently from k (z* | y , A) where A is 
fixed at A or A* . 



Estimating Parameters with SEM in Comparison with the Gibbs Sampling Approach 

It seems worthwhile to compare this implementation of SEM with a fully conditional 
decomposition of the Gelfand and Smith’s (1990) Gibbs sampling, described in Fox and Glas 
(2000). Define the augmented data Z* = (Z, 6 , (3) and the parameters of interest as A. This 
Gibbs sampler generates samples from the following posterior distribution, 

P(*\y) = J J P(A I z*,y)p(z* | A',y)dz*p(A' | y)dA'. (21) 

In fact, the described Gibbs sampler generates samples from the marginal posterior distributions 
of parameters £, a 2 , 7 and T, including priors for the parameters. There are two natural 
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estimates for A following from formula (21) (see, Lehmann & Casella, 1998, pp. 257): 



1 M 



>») 



m=l 



( 22 ) 



and 



1 M 

= E £ ( A iy’ z * (m) )- (23) 

m=l 

Here, A e is called the empirical estimator (Liu et al., 1994) and A m is called the mixture 
estimator. Assuming that the conditional density p (A | z*,y) is simple, the latter is often 
easy to compute. The following difference between these estimates can be noted. The SEM 
estimate, formula (17) , and the mixture estimate resulting from the Gibbs sampler are the mean 
of the expectations of the parameters given the pseudo-complete data, whereas the empirical 
estimate resulting from the Gibbs sampler are the mean of the marginal posterior distributions 
of the parameters. Liu et al. (1994) showed, under mild conditions, that the mixture estimator is 
always better because it has a smaller variance than the empirical estimator. That is, the mixture 
estimator has a smaller variance attributable to the Gibbs sampler in estimating the posterior 
mean. The posterior variances and credibility intervals are estimated from the sampled values 
obtained from the Gibbs sampler. Because the posterior density of A given Z* and Y contains 
a prior for A, formula (21) , it follows that the mixture estimate, formula (23) , differs from the 
SEM estimate, formula (17) . Moreover, the differences between the sampling schemes will 
cause different estimates. 



A Dutch Primary School Language Test 

To compare the SEM algorithm with the MCMC algorithm, a dataset from a Dutch 
primary school language test was analyzed. A multilevel IRT model was estimated with the 
SEM algorithm and the Gibbs sampler. Furthermore, a comparison was made between the 
multilevel IRT model and an hierarchical model with observed scores only. 

This research project entailed investigating whether schools that participate in the 
central primary school leaving test in the Netherlands on a regular basis perform better than 
schools that do not participate on a regular basis. The pupils of 97 schools were given a 
language test for Grade 8 students. In this analysis, 24 items designed by the Netherlands 
National Institute for Educational Measurement (Cito) were used. These items were taken 
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from a standardized Cito test used in most Dutch schools at Grade 8 , called the primary school 
leaving test. The total number of pupils for which data were available was 2156. Schools 
participating in the Cito test (72 schools) on a regular basis are called the Cito schools. The 
remaining 25 schools will be called the non-Cito schools. 

Two students’ characteristics were used as a predictor for the students’ achievement: 
socio-economic status (SES) and non-verbal intelligence measured using the ISI test. The 
SES is based on four indicators: the education and occupation of both parents. Non-verbal 
intelligence was measured in Grade 7 by using three parts of an intelligence test. The predictors 
ISI and SES were normally standardized. A predictor labeled End equaled 1 if the school 
participates in the school leaving test, and equals 0 if this is not the case. A complete description 
of the data can be found in (Doolaard, 1999, pp. 57). 

The structural model used in the analysis is given by, 

&ij = Pqj + /3jISIjj + /3 2 SES ij ■+■ j 

Poj = Too + ToiEndj + u 0j 

01 = Tio 

02 = 720 > 

where e %j - N (0, a 2 ) and u 0j - N (0, r 2 ) . The two-parameter normal ogive model was used 
as the measurement model. 

The following procedure was used to obtain initial estimates. Initial values of the 
item parameters were computed using Bilog-MG (Zimowski et al., 1996). A distinct ability 
distribution was used for every subgroup j. Then the MCMC procedure by Albert (1992) for 
estimating the normal ogive model was run. As the Gibbs sampler had reached convergence 
the means of the sampled values of (Z, 6, £) were computed. An EM algorithm was used for 
estimating (f3, a 2 , 7 , T) with the 6 (see, for instance Bryk & Raudenbush, 1992). 

The number of iterations necessary to reach convergence of the SEM algorithm cannot 
be evaluated simply in a general setting. For the Dutch primary leaving test described above, 
5,000 iterations were ‘enough’ in the sense that after a bum-in period of 1,000 iterations a 
substantial increase in the number of iterations did not perturb the values of ergodic averages. 
Additionally, at every iteration 25 Gibbs sampling steps were taken to generate a sample of the 
pseudo-complete data. The differences in the results were negligible when ranging these Gibbs 
sampling steps between 20 to 75. The fully conditional decomposition of Gibbs sampling as 
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in Fox & Glas (2000) was run for 20,000 iterations, with a bum-in period of 5,000 iterations. 
Non-informative priors were used for the parameters in the Gibbs sampling implementation. 
A non-informative prior for the difficulty and discrimination parameter, insuring that each 
item will have a positive discrimination index, and assuming independence between the 
item difficulty and discrimination parameter leads to the simultaneous noninformative prior 
p (£) a nJLi I ( afc > 0) . A uniform prior was placed on the fixed effects and on the variance 
components, that is, p (7) a c,p ( a 2 ) a l/cr 2 andp (r 2 ) cc 1 /t 2 . 

First, the parameter estimates of the measurements model are considered, after that, 
the parameter estimates of the structural model and further implications of these estimates are 
considered. 

In Table 1 and Table 2, the estimates of the item parameters resulting from the Gibbs 
sampler with the mixture estimator and the SEM algorithm are given. The SEM algorithm 
produces two estimators, the mean of the stationary distribution, formula (17) , labeled under 
the column mean, and the point corresponding to the largest observed likelihood, formula (18) , 
labeled under the column max. The multilevel IRT model was identified by fixing two item- 
parameters, here, a 2 4 = 1 and 624 = 0. 

The col umns labeled SD present the standard deviations of the estimates resulting 
from the SEM algorithm using Louis identity, formula (20). In this application, 100 samples 
of (Z ,0,(3) were obtained to compute the observed information matrix. Unlike the SEM 
estimates, the estimates resulting from the Gibbs sampler are calculated in a Bayesian 
framework. Therefore, the posterior standard deviations of the parameters are denoted by PSD. 
Further, the parameter estimates resulting from the Gibbs sampler are the posterior means. It 
can be seen that the SEM estimates of the item parameters are close to the mixture estimates 
resulting from the Gibbs sampler. Confidence intervals are used to compare the uncertainty 
about the parameter estimates in relation to the different estimators. The Bayesian analogue of 
a frequentist confidence interval is usually referred to as a credibility interval. In the Bayesian 
framework, the central posterior credibility intervals are calculated as confidence regions for the 
parameters. The 95% central posterior credibility intervals are given under the column labeled 
Cl. All SEM estimates are well within the computed central posterior credibility intervals. 
Notable, the posterior standard deviations are, in almost all cases, larger than the standard 
deviations related to the SEM estimates. More detailed information concerning this point will 
be provided later. 

Table 3 presents the results of estimating the fixed effects and random components of 
the model computed with the Gibbs sampler and the stochastic EM algorithm. The main result 
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of the analysis is that conditionally on SES and ISI, the Cito schools perform better than the 
non-Cito schools. The fixed effect, 7 01 , models the contribution of participating in the school 
leaving exam to the ability level of the students via its influence on the intercept (3 0j . This 
intercept /? 0j is defined as the expected achievement of a student in school j when controlling 
for SES and ISI. Thus a positive value of 7 01 indicates a positive effect of participating in the 
school leaving exam to the students’ abilities. Further, there is a highly significant association 
between the Level 1 predictors ISI and SES and the ability of the students. Obviously, students 
with high ISI and SES scores perform better than students with lower scores. The residual 
variance for the school-level, To, is the variance of the achievement of students in school 
j,/3 0j , around the grand mean, 7 00 , when controlling for SES and ISI. Obviously, the use of 
a multilevel model is justified, because a substantial proportion of the variation in the outcome 
at the student level was between the schools. 

In terms of the the SEM and the Gibbs sampling estimates, the fixed and random 
effects are generally quite the same, except for the Level 2 variance, r . The significant 
difference between the Level 2 variance-estimates results in different intraclass correlation 
coefficients. The proportion of variance in ability accounted for by group-membership, after 
controlling for the Level 1 predictor variables is .345 according to the SEM variance-estimates 
and .330 according to the SEM variance-estimates which maximizes the observed likelihood. 
This coefficient is .398 in case of the variance-estimates resulting from the Gibbs sampler. 
As an additional check the fixed effects and variance components are also estimated from the 
observed scores using HLM for windows (Bryk et al., 1996). For comparative purposes, the 
unweighted sums of the item responses were rescaled such that their mean and variance were 
equal to the mean and variance of the posterior estimates of the ability parameters, respectively. 
The standard deviations of the HLM estimates are given under the column labeled SD. The 
estimate of the Level 2 variance component is smaller in the HLM analysis whereas the estimate 
of a is alm ost similar in comparison to the other estimates. The intraclass correlation coefficient 
consisting of these variance components is .301, which is smaller than the estimates of the 
intraclass correlation coefficient from the SEM approach. Furthermore, the estimates of the 
fixed effects are smaller except for the main effect, 7oo- 1 ° conclusion, the multilevel IRT 
analysis, estimated with the Gibbs sampler and SEM, measures a greater variance between 
students’ abilities which results in a larger school-level effect. Further, a sharper distinction in 
students’ achievements is attained. 

The standard deviations of the SEM estimates are larger than the standard deviations 
of the estimates resulting from the analysis in HLM using observed scores. Obviously, the 
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estimates resulting from HLM are based on the observed scores, which results in more accurate 
estimates, that is, the HLM analysis does not take the uncertainty of the ability parameter into 
account. It can be seen from Tables 1 to 3 that in most cases the standard deviations related to 
the stochastic EM estimates are smaller than the posterior standard deviations. This observation 
was also made in Fox & Glas (2000) and Glas, Wainer and Bradlow (2000). It seems that the 
smaller size of the standard deviations in the frequentist framework is related to the fact that 
they are based on an asymptotic approximation that does not take the skewness into account. 

Finally, Figure 1 shows the plausible region of the variance components. The region 
contains the parameter estimates of (<r, r) obtained at every iteration of the stochastic EM 
algorithm. The most central point, that is the mean of ( a , r), corresponds to the stochastic EM 
estimate of (a, r) , formula (17) . The point with the largest observed log-likelihood, formula 
(18) , lies in the circle close to the mean. The points within the circle represent estimates 
of (cr, r) with high observed log-likelihood values, that is, the corresponding log-likelihood 
values are close to each other and therefore close to the highest observed log-likelihood. This 
illustrates the general idea behind stochastic EM. The parameters of interest are estimated by 
taking the mean over all points within the plausible region, where all points correspond to high 
observed log-likelihood values. As a result, this estimate lies close to the maximum likelihood 
estimate, which is checked by computing the observed log-likelihood at every iteration. 

Discussion 

In this article, a stochastic EM algorithm is used to estimate the parameters of a 
multilevel IRT model. The multilevel IRT model which entails treating the dependent ability 
parameters as latent variables in a multilevel model and using an IRT model to model these 
variables has several advantages, such as its realistic treatment of measurement error and the 
application in incomplete designs. Although direct parametric inference is hard because the 
likelihood function is very complex, maximum likelihood estimates can be obtained with the 
stochastic EM algorithm. 

The use of a SEM algorithm for estimating the parameters of a multilevel IRT model 
has several appealing features. First, the algorithm is easy to implement. Second, although the 
am ount of computation involved can be large, the SEM algorithm can handle the numerical 
integrations needed also in cases with more than two levels. Moreover, there are no limitations 
to the number of parameters or the number of explanatory variables. It must be remarked that 
MML or Bayes model estimation procedures are possible but require the calculation of two- 
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dimensional integrals in the case of two levels. The implementation of the Gibbs sampler also 
has no limitations to the number of levels (Fox & Glas, 2000). Moreover, the procedure can 
also be applied to other measurement error models with latent ability parameters. 

The stochastic EM algorithm performs direct inference based on the pseudo-complete 
data whereas the Gibbs sampler samples the entire posterior distributions of the parameters. 
In the comparison presented, both methods gave almost similar results. It must be pointed out 
that the differences between the standard deviations and the posterior standard deviations needs 
further research. 

The convergence of this implementation of the algorithm is slowed down by the Gibbs 
sampling procedure for sampling the pseudo-complete data. The convergence is speeded up 
by the block Gibbs sampler, but a further improvement could be the use of another samplings- 
technique to sample all pseudo-complete data at once. General techniques for simulating draws 
directly from the target density as rejection sampling or importance sampling (Gelman et al., 
1995) could improve the rate of convergence. Furthermore, the number of iterations needed to 
get a stable estimate could be reduced. 
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Figure 1. Plausible region for (a, r) , generated by stochastic EM. 
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Table 1. Parameter estimates of the discrimmation 
sampler. 



parameter with SEM and the Gibbs 




1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 



.856 075 

.654 .066 

.928 .086 
.668 .057 

1.158 086 

1.190 .087 

.297 052 

1.454 .072 

.968 -072 

.972 .073 

.927 .083 

1.019 075 

.738 .060 

1.112 .076 

.746 .062 

.562 .055 

.685 058 

1.042 .062 

1.174 .083 

.977 .071 

.973 080 
.955 071 

1.113 063 

1 0 



.816 .074 

.619 064 

1.038 085 

.631 064 

1.058 .087 

1.165 .085 

.280 056 

1.445 .074 

.894 074 

.912 .072 

.845 .082 

.981 075 

.652 .061 

1.047 .075 

.681 062 
.571 053 

.641 057 

.964 062 

1.050 .084 

.884 .072 

.898 .080 

.909 .072 

.982 .063 

1 0 



.784 
.597 
.870 
.628 
1.089 
1.097 
.265 
1.407 
.911 
.910 
.845 
.960 
.696 
1.055 
.698 
.525 
.647 
1.011 
1.084 
.914 
.881 
.893 
1.081 
1 



.074 
.061 
.096 
.059 
.099 
.091 
.042 
.122 
.078 
.078 
.084 
.088 
.064 
.092 
.066 
.053 
.061 
.087 
.107 
.082 
.075 
.082 
.089 
0 



[ 0 . 646 , 0 . 938 ] 
[ 0 . 485 , 0 . 724 ] 
[ 0 . 698 , 1 . 073 ] 
[ 0 . 520 , 0 . 751 ] 
[ 0 . 906 , 1 . 296 ] 
[ 0 . 927 , 1 . 290 ] 
[ 0 . 186 , 0 . 351 ] 
[ 1 . 186 , 1 . 663 ] 
[ 0 . 767 , 1 . 078 ] 
[ 0 . 765 , 1 . 073 ] 
[ 0 . 691,1 025 ] 
[ 0 . 796 , 1 . 143 ] 
[ 0 . 578 , 0 . 830 ] 
[ 0 . 888 , 1 . 250 ] 
[ 0 . 575 , 0 . 833 ] 
[ 0 . 427 , 0 . 632 ] 
[ 0 . 533 , 0 . 775 ] 
[ 0 . 850 , 1 . 195 ] 
[ 0 . 888 , 1 . 304 ] 
[ 0 . 764 , 1 . 083 ] 
[ 0 . 743 , 1 . 037 ] 
[ 0 . 741 , 1 . 062 ] 
[ 0 . 916,1 265 ] 
11 , 1 ] 
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Table 2. Parameter estimates of the difficulty parameter with SEM and the Gibbs sampler. 



Item 




SEM 






Gibbs Sampler 


mean 

b SD 


max 

b SD 


b 


PSD 


Cl 


1 


—.227 


.049 


-.257 


.045 


-.259 


.044 


[-.341, -.168] 


2 


-.169 


.045 


-.190 


.046 


-.197 


.038 


[-.266, -.119] 


3 


-.843 


.048 


-.836 


.043 


-.870 


.051 


[-.963, -.766] 


4 


.332 


.042 


.315 


.042 


.313 


.040 


[.241,. 396] 


5 


-.281 


.051 


-.284 


.052 


-.312 


.056 


[-.414, -.195] 


6 


.708 


.059 


.733 


.059 


.663 


.060 


[.553, .790] 


7 


.475 


.041 


.444 


.042 


.458 


.031 


[.400, .521] 


8 


-.086 


.048 


-.072 


.044 


-.109 


.069 


[-.234, .035] 


9 


.481 


.049 


.468 


.051 


.455 


.051 


[.362, .560] 


10 


.100 


.047 


.080 


.045 


.073 


.049 


[-.016, .176] 


11 


-.451 


.050 


-.454 


.050 


-.487 


.048 


[-.574, - .388] 


12 


-.222 


.048 


-.207 


.050 


-.249 


.051 


[-.342, -.143] 


13 


.152 


.041 


.121 


.041 


.133 


.042 


[.056, .218] 


14 


.052 


.049 


.031 


.049 


.026 


.055 


[-.072,. 142] 


15 


-.045 


.043 


-.078 


.043 


-.067 


.041 


[-.142, .020] 


16 


.216 


.041 


.233 


.042 


.198 


.035 


[.133, .271] 


17 


.243 


.041 


.223 


.042 


.226 


.040 


[.152, .309] 


18 


.160 


.043 


.126 


.044 


.147 


.054 


[.049, .259] 


19 


-.557 


.052 


-.591 


.050 


-.595 


.056 


[-.698, -.476] 


20 


-.124 


.074 


-.132 


.068 


-.154 


. .049 


[-.244, -.053] 


21 


.289 


.054 


.259 


.055 


.244 


.048 


[.156, .346] 


22 


-.177 


.046 


-.212 


.046 


-.205 


.048 


[-.293, -.105] 


23 


.199 


.043 


.154 


.043 


.184 


.055 


[.083, .299] 


24 


0 


0 


0 


0 


0 


0 


[0,0] 
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Table 3. Parameter estimates of the multilevel model with the Gibbs sampler, SEM and 
HLM using sum scores. 



SEM Gibbs Sampler HLM 



mean max 



Fixed Effects 


Par. 


SD 


Par. 


SD 


Par. 


PSD 


Cl 


Par. 


SD 


7oo 


.334 


.204 


.349 


.197 


.327 


.206 


[-.074, .729] 


.361 


.044 


7oi 


.262 


.237 


.273 


.225 


.277 


.236 


[-.183,.740] 


.223 


.051 


7io 


.184 


.014 


.196 


.013 


.194 


.018 


[.160, .231] 


.156 


.010 


720 


.158 


.014 


.168 


.014 


.168 


.017 


[.136, .204] 


.127 


.011 


Random Effects 


Par. 


SD 


Par. 


SD 


Par. 


PSD 


Cl 


Par. 




a 


.423 


.020 


.439 


.021 


.445 


.027 


[.387, .506] 


.443 




T 


.223 


.010 


.216 


.009 


.294 


.027 


[.222, .390] 


.191 
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